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Current approaches to the design of combinatorial libraries assume that structural diversity in the reactant 
pools corresponds to structural diversity in the combinatorial libraries that result from ^S^S 
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INTRODUCTION 

The last few years have seen an explosive growth, in the 
use of combinatorial methods for the creation of extremely 
large libraries of structurally-diverse molecules, from which 
it has proved possible to identify biologically-active mol- 
ecules far more rapidly than is possible using conventional 
approaches to drug discovery. 1 " 4 The effectiveness of the 
approach is crucially dependent on the building blocks, or 
reactants, that are used as the input to the combinatorial 
synthesis of the final products since there are generally far 
more reactants available than can actually be used in 
practice. 5 For example, peptoids are polymers of N- 
substituted glycine that have a peptide backbone but with 
side chains attached at the amide nitrogen instead of the 
a-carbon. 6 Peptoids are synthesized by incorporating the 
side chains from amines. Public databases of chemical 
structures, such as the Available Chemicals Directory, may 
contain many hundreds of different, readily-available amines, 
thus permitting the synthesis of libraries containing extremely 
large numbers of different molecules, even if attention is 
restricted to tri- and tetrapeptoids. 

Techniques for combinatorial synthesis have developed 
rapidly, and it is now possible to synthesize extremely large 
numbers of compounds in single combinatorial experiments. 
Combinatorial synthesis, therefore, is an efficient way of 
providing compounds for high throughput screening for the 
discovery of new leads. However, it is the rate at which 
compounds can be screened that is the limiting step in 
combinatorial chemistry. One way of increasing throughput 
is to synthesize and screen compounds as mixtures rather 
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than as discrete compounds. Although this method does 
allow a large number of compounds to be screened, there 
are several problems associated with the handling of 
mixtures. These include the difficulty of assessing the 
quality of the mixtures to detennine whether all intended 
compounds have actually been synthesized and the possibility 
that a positive screening result may in fact be the result of 
the synergy of several structures so that no activity is found 
when the compounds are deconvoluted and screened inde- 
pendently. The problems associated with synthesizing and 
screening mixtures are avoided by the parallel synthesis and 
testing of discrete compounds. Since a much smaller number 
of compounds can be screened, it is then necessary to use 
compound selection in order to reduce the number of 
compounds available for testing. There has, therefore, been 
much interest in techniques for the selection of sets of 
dissimilar reactants from existing chemical databases, so that 
the compounds that are generated cover a wide range of 
structural types. 5 ' 14 It is assumed that if it is possible to 
identify maximally-diverse (or, more realistically, near 
maximally-diverse) sets of reactants, then their use will result 
in the generation of a maximally-diverse combinatorial 
library of products. If this assumption is correct, it will 
permit the full exploration of the potential structural space 
even though only a relatively small number of compounds 
are actually synthesized and tested. In what follows, the 
assumption that diversity at the reactant level reflects 
diversity at the product level is referred to as the diversity 
hypothesis. 

The assumption that a diverse set of products will result 
from a diverse set of reactants has not yet been tested 
experimentally owing to the sheer numbers of compounds 
that are involved. For example, the tripeptoid library 
investigated by Martin et al was based on no less than 721 
primary amines (and 1 133 carboxylic acid and acid^hloride 
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ammo-terminal capping groups) but used only 18 members 
of the reactant pool. 6 This paper reports a quantitative 
examination of the validity of the diversity hypothesis. We 
show that the diversity hypothesis is incorrect and that 
selection of diverse reactants does not result in maximum 
diversity in product space. We then report on a more 
effective method that we have developed for selecting 
reactants by analyzing product space. While still not optimal 
in terms of maximizing a quantitative index of structural 
diversity, this approach provides a noticeably better solution 
than existing methods for the selection of reactants in order 
to maximize the diversity of combinatorial libraries. 

TESTING THE DIVERSITY HYPOTHESIS 

Theoretical Background. Consider a combinatorial 
library, c, that is synthesized from reactants contained in two 
reactant pools, r\ and r 2} of sizes m and n 2 , respectively (in 
the following, we consider only dimer libraries for the 
purpose of simplicity). These two reactant pools have 
previously been selected as representing diverse subsets of 
two larger potential-reactant pools, R i and R 2 , of sizes N x 
and N 2) respectively, using some quantitative subset-selection 
procedure. Let C be the combinatorial library that would 
have been generated from all possible combinations of R { 
and R 2 if the subset-selection procedure had not been used. 
Thus, c and C contain n x n 2 and N X N 2 dimers, respectively. 
The same subset-selection procedure that was used to create 
the reactant pools r x and r 2 , {i.e., that was used to identify 
the m most dissimilar molecules in R x and the n 2 most 
dissimilar molecules in R 2 ) is used to identify the most 
dissimilar n x n 2 molecules from amongst the N X N 2 molecules 
in C. This subset is referred to subsequently as library L d *. 
The construction of the libraries c and L d * is illustrated in 
Figure 1. 

Let D{X) be a function that returns a value describing the 
diversity of a set of molecules, X. Then the diversity 
hypothesis would suggest that c is comparable in structural 
diversity to L/, i.e., that 

D(L/) = D(c) 

This is actually a limiting case since it is easy to prove that 

D(L*) > D(c) 

by contradiction. Assume that the converse is true and that 
there is thus a subset of size n x n 2 from c that has a greater 
level of diversity (however this is defined) than the subset 
of the same size from C. Now both c and L d * are subsets 
of C, i.e., c c C and L d * C C, and thus every member of c 
must also be in C. However, if the subset L d * is defined to 
be the maximally-diverse subset that can be generated from 
C then, of necessity, 

D(Lf) > D(c) 

This contradicts the original assumption, which must thus 
be false. These arguments demonstrate that it is possible 
for a subset-selection procedure to identify a subset that is 
equal in diversity to that of a fully enumerated library but 
that it is never possible (in accordance with intuition) to 
identify one that is superior. In fact, given that L d * is selected 
from the fully enumerated library it is very likely that D(L d *) 



GrLLET ET AL. 

Thus far, it has been assumed that the subsets r u r 2) and 
Ld* are maximally diverse, however, this is unlikely to be 
achieved in practice. Selection of the maximally-diverse 
subset is computationally infeasible since it requires evalu- 
ation of 
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subsets, where a subset of n compounds is selected from a 
library containing N compounds. Let Z)(max) be the 
maximum diversity that is possible for a subset of C of size 
nm 2 . It is known that D{c) and D(L d *) are unlikely to be 
equal to £(max). It is also possible to find the most similar 
subset of compounds of a collection. Let this subset be L s * 
with diversity Z)(V). Any algorithm for the selection of 
the most similar subset is also likely to be suboptimal, and 
if £>(min) is the maximally-similar subset, then it is likely 
that D(L S *) > Z)(min). Let D(L*) be the diversity of a subset 
that is selected at random from C. The subsets selected from 
C are summarized in Figure 2 and are referred to as libraries 
of compounds; however, they are most unlikely to represent 
combinatorial libraries. Assuming that the library c is of 
greater diversity than a library selected at random, the relative 
ordering of the diversities of the libraries is then expected 
to be 

D{mm) < D(L*) < D(L*) < D{c) < D(L d *) < D(mzx) 

In the first set of experiments described in this paper, 
L>{Ld*\ D{L*\ D{L*), and D(c) were measured for three 
different libraries in an attempt to test the diversity hypoth- 
esis. Given subset-selection procedures that are not guar- 
anteed to be optimal, the question at issue when considering 
the diversity hypothesis is whether it is possible to achieve 
(near)-equally diverse subsets, and whether subset-selection 
at both the reactant level and at the product level is 
significantly more effective than selecting a subset at random 
If Z)(max) and Z)(min), the upper- and lowerbounds on 
diversity, respectively, are known, then a more quantitative 
understanding of the difference between D(L d *), D(c), and 
D{L*) can be determined. However, D(max) and D(mm) 
cannot be measured directly, and the second set of experi- 
ments was designed to provide estimates for these lirniting 
values. The final experimental section describes the new 
algorithm we have developed for selecting combinatorial 
libraries from product space so as to maximize diversity 

Experimental Details. Two procedures are required to 
demonstrate the validity of the diversity hypothesis: a 
method of selecting maximally diverse subsets that can be 
applied to the reactant pools and to the product library C 
and a method of quantifying the inherently qualitative 
concept of "molecular diversity". 

Dissimilarity-based compound selection (DBCS) 10 " 14 in- 
volves the identification of the maximally-diverse subset of 
size n from a database of size N (where, typically, n « N). 
The DBCS algorithm involves summing the dissimilarities 
of every molecule with all of the other molecules in the set 
The first molecule to be selected is that which is most 
dissimilar from all of the others, i.e., it has the greatest sum 
of dissimilarities of all the molecules. The second molecule 
to be selected is that which is most dissimilar to the first 
Ine third molecule is that which is most dissimilar to the 
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Figure 1. Subset selection can be performed at either the reactant 
evel o r at the product level. The diversity hypothesis assumes 
that diversity at the reactant level reflects diversity at the product 
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Figure 2. Different libraries selected from the fully enumerated 
™$™f^hbmry C: Lf is the library selected by applying 
UBCS, 1/ is the library selected by applying SBCS: and I* is a 
library selected at random. 



first and the second and so on. The DBCS method used 
here for the selection of diverse reactants (the pools r x and 
r 2 ) and diverse products (the library L d *) is that described 
by Holliday et al. lA and is based on the cosine coefficient 
This implementation of the DBCS algorithm can be applied 
to sets of molecules using any structural representation that 
is described by a vector. We have chosen to use Daylight 
fingerpnnts, 15 containing 1024 bits, for most of the experi- 
ments reported here although we have also briefly tested the 
hypothesis for another representation that is described later. 

The DBCS algorithm can be modified easily to select the 
most similar subset of molecules from a collection. Thus, a 
similarity-based compound selection (SBCS) algorithm was 
implemented by measuring pairwise similarities, rather than 
dissimilanties, and selecting molecules that are most similar, 
rather than most disimilar, to those already selected. The 
SBCS algorithm was applied to the fully enumerated library, 
C, to generate the subset of most similar compounds, referred 
to as library L s *. 

In calculating the diversities of the various sets of 
compounds, we have followed previous workers 5 * 6 ' 9 in 
assuming that the diversity of a set of molecules can be 
determined from the intermolecular structural dissimilarities 
for that dataset. Specifically, we have used the diversity 
measure described by Turner et al., 16 which is the mean 
intermolecular dissimilarity when averaged over all the pairs 
of molecules in a dataset and which provides an easily 
calculable single-valued representation of the diversity of 
molecules in a dataset. 

The DBCS ^gonthm was first used to identify the most 
dissimilar mm molecules from amongst the NiN 2 molecules 
in C; the selected molecules form the subset library L d *. The 
algorithm was then used to create the reactant pools n and 
r 2 by identifying the «, most dissimilar molecules from 
amongst the N x molecules in R x and the n 2 most dissimilar 
molecules from amongst the N 2 molecules in R 2 ; the 
combination of these two reactant pools forms the combi- 
natorial library a The diversities of L d * and c were then 
calculated and compared, as detailed in Figure 3. Random 
subsets, L* y and subsets of similar compounds, L s *, were 
also selected for comparison by analogous procedures. 

The selection of a set of n molecules from a database of 
AT molecules using our DBCS algorithm has a time complex- 
ity of order 0(nN)}* The creation of the library L d * requires 
the selection of mm diverse molecules from the NM 
molecules in C (step 2 in Figure 3) and hence has a time 



complexity of OimmN^). Similarly, the creation of the 
combinatorial library c requires the generation of the two 
diverse reactant pools, n and r 2 (steps 4 and 5 in Figure 3) 
and hence has a total time complexity of order Ofatfj) + 
0(n 2 N 2 ). The remaining steps in Figure 3 have complexities 
of 0(^Ay for step 1, 0( ni n 2 ) for steps 3, 6, and 7, and 
0(1) for step 8. Step 2 hence dominates the computation, 
and the procedure thus has an overall complexity of 
OimnJViNd. In most of the experiments reported below 
both m and n 2 were 40 and both N x and N 2 were 400 with 
the selection of 1600 compounds from 160 000 taking ap- 
proximately 2.6 h on a Silicon Graphics R10000 workstation. 

The experiments involved three different combinatorial 
library systems: an amide library that was built by coupling 
carboxylic acids to primary amines by forming a peptide 
bond; two related libraries based on benzoic acid as a 
template; and a library based on Kemp's acid. It should be 
noted here that the experiments are "paper" experiments 
designed to test the diversity hypothesis and they are not 
real synthetic experiments. The libraries varied from having 
no common substructural core, in the first example, through 
having a small core substructure in the benzoic acid 
examples, to having a large core substructure where the 
substituents are generally relatively small compared to the 
core itself, in the Kemp's acid example. In each case, the 
Daylight Toolkit 16 was used to develop software to perform 
the required "reaction" between reactants in different pools 
in order to enumerate the libraries. In all of the experiments 
C contained 160 000 products and the sizes of L d * i L s *, L* 
and c were varied. ' ' 5 

RESULTS 

Amide Library. The amide combinatorial library, C, was 
enumerated by coupling carboxylic acids to primary' amines 
through the formation of a peptide bond, see Figure 4 A 
pool of amine reactants was formed by choosing . a random 
set of 400 molecules from the primary amines that are present 
in the World Drug Index (WDI)." Similarly, a pool of 
carboxylic acid reactants was formed by choosing from the 
same file arandom set of 400 molecules that contain a single 
carboxyhc acid group. Software was developed to join 
combinations of amines and carboxylic acids by forming 
peptide bonds. * 

The full library, C, of size 160 000, was constructed by 
joining all 400 amines to all 400 carboxylic acids In the 
first experiments, a diverse subset, L/, containing 1600 
molecules was selected from Cusing the DBCS algorithm 14 
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1 Create the tf^ products in library C by combining each ofthetfi reaclantsin*! with 
each of the N 2 reactants in R 2 . 

2. Create the library L d * by selecting the n x n 2 most diverse products from C 

3. Calculate the sums of dissimilarities for all pairs of products in Lf y and hence the 
diversity D(Lj*). 

4. Create r x by selecting the n\ most diverse reactants from R x . 

5. Create r 2 by selecting the n 2 most diverse reactants from % 

6. Create the n x n 2 products in library c by combining each of the », reactants in r, with 
each of the n 2 reactants in r 2 . 

7. Calculate the sums of dissimilarities for all pairs of products in c, and hence the 
diversity D(c). 

8. Compare/)^*) with D(c). 

Figure 3. Procedure for generating libraries c and L d * that can then be used to test the diversity hypothesis. 
Table 1. Test of the Diversity Hypothesis Using an Amide Library* 
D ( 1 **) Wc) D iLr) D{1 * } 
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Table 2. Effect of Library Size, #( c ), on Diversity Using ail Amide 
Library 



0.652 
0.651 
0.651 



0.596 
0.589 
0.594 



0.508 (0.003) 
0.509 (0.004) 
0.510(0.004) 



0.132 
0.134 
0.132 



#( C ) 



WJ) D(c) 



* 'itf&Tt giveS ^ mean diveisi *y ^ standard deviation (in brackets) 
for 1 000 subsets chosen at random. 



and Daylight fingerprint representations of the molecules. 
The same procedure and structural representation was then 
used to select 40 diverse amines and 40 diverse acids from 
the two pools of 400 reactants; the selected acids and amines 
were used to generate the combinatorial library, c, of 1600 
amides. The diversity hypothesis was then tested by 
comparing the D(c) and D(L d *) values using the procedure 
shown in Figure 3. D(Z,*) and D{L*) were also measured. 
The entire procedure was repeated three times, using different 
randomly-chosen pools of amine and carboxylic acid reac- 
tants. 

The results are shown in Table 1, where it can be seen 
that the relative ordering of the libraries is as expected* 
iW) < D(L*) < D(c) < D(L d *). A more diverse library 
of compounds is thus generated by performing compound 
selection at the product level rather than at the reactant level 
and selection at both the product and reactant levels results 
in more diverse libraries than selection of a subset of 
compounds at random. The distribution of diversities is not 
symmetrical; that is, the diversity of a library selected at 
random is closer to the diversity of the library selected using 
the DBCS algorithm than to the diversity of the library 
selected using the SBCS algorithm. This is because of the 
nature of a combinatorial library, where a small number of 
reactants is used to generate a large number of products 
There are many subsets of very similar compounds for 
example, a subset where the carboxylic acid component is 
constant and variation is seen in the amine component only 



1 600 
900 
400 
100 



D(l*) 



0.652 
0.658 
0.663 
0.674 



0.596 
0.594 
0.596 
0.582 



0.509 (0.003) 
0.509 (0.006) 
0.509(0.015) 
0.513 (0.059) 



0.132 
0.125 
0.127 
0.100 



Further runs were carried out for different sizes of subsets* 
specifically libraries of 1600, 900, 400, and 100 compounds 
were selected that correspond to reactant subset sizes of 40 
30, 20, and 10, respectively. The results of these runs are 
shown for an amide library in Table 2, where it will be seen 
that the D(L d *) values increase as the size of the library 
decreases. This is because an algorithm for DBCS will 
initially tend to select molecules from "around the edges" 
and then move toward the "center" of the dataset once its 
periphery has been fully explored. The first molecules 
selected, which will be those in a very small subset, will 
hence tend to be more dissimilar to each other than those 
selected later, and since diversity is measured as the mean 
intermodular dissimilarity averaged over all molecules in 
the set, the diversity increases as the library size decreases 
with the resulting trend that is shown in Table 2. The mean 
diversities of the randomly chosen subsets do not vary 
much but the range of values increases as the subsets 
decrease in size. Less variation is seen in the D(c) values 
than the £>( V) values, with the 1 600, 900, and 400 subsets 
havmg very similar diversities. However, the smallest 
combinatorial library of 1 00 molecules shows an unexpected 
decrease m diversity that is not statistically significantly 
different from the diversity of subsets chosen at random In 
selecting the most diverse reactants no account is taken of 
the diversity of one reactant subset relative to the other and 
it could be that, although diversity within a subset is 
maximized, the subset as a whole contains molecules that 
are similar to those found in the other subset This 
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Table 3. Effect on D(c) When Reactants Are Chosen To Maximize 
the Diversity of Both Reactant Pools Taken Together 



D(c) 



1600 
900 



D(c) 



0.599 
0.602 



400 
100 



0.607 
0.607 



Table 4, Test of the Diversity Hypothesis Using an Amide Library * 



0.449 



0346 



0.133 (0.003) 



" Disimilanty comparisons are made using structural features L d * 
contains the 1600 compounds that result from performing DBCS on 
the product library Q and c contains 1600 compounds that result from 
using DBCS to select 40 diverse reactants from each pool and then 
enumerating the products. D(L*) gives the mean diversity and standard 
deviation (in brackets) for 1000 subsets of size 1600 ch osen at random. 

assumption was tested by altering the way in which reactants 
are selected. Table 3 shows the results obtained when 
reactants are selected alternately and the diversity of the two 
subsets together is maximized. In this case, the D(c) values 
also increase as the size of the subsets decrease. 

Table 4 shows the results obtained when a different kind 
of representation was used to measure the dissimilarity 
between molecules. In this case, the molecules were 
represented by a number of structural features that might 
better relate to biological activity. Each structure is repre- 

^ sented by the following six different features: counts of 
hydrogen bond donors, hydrogen bond acceptors, rotatable 
bonds, and aromatic rings and the physical property values 
of molecular weight and the 2 tc a shape descriptor. A 
hydrogen bond donor is defined as any heteroatom that 
carries at least one hydrogen. A hydrogen bond acceptor is 
defined as a heteroatom, excluding the halogens, aromatic 
oxygen, sulfur, and pyrrole nitrogen and the higher oxidation 
levels of nitrogen, phosphorus, and sulfur. (All of the 
compounds used in the experiments have neutral charge.) 
The feature values were standardized to fit into the range of 
0..1. This is achieved by finding the maximum and 
minimum values over the whole library of molecules. The 

normalized value for each feature, x, in each molecule is 

then given by 

x — min 



max — mm 



D(c) and D(L/) were calculated using the DBCS algorithm 
as before. These results again show that a more diverse 
library results if selection is performed at the product level 
rather than at the reactant level, demonstrating that our results 
are not dependent on the particular structural representation 
that is used. 

3-Amino-5-hydroxybenzoic Acid Library. Dankwardt 
et aV* have discussed the generation of combinatorial 
libraries in which the core molecule is 3-arnino-5-hydroxy- 
benzoic acid. In one of their experimental schemes, the 
carboxyhc acid serves as a handle onto which the solid 
support is attached and diversity is incorporated onto the two 
remaining functional groups. Two different reactions were 
simulated in the present study, both involving the substitution 
of carboxyhc acids onto the amine group. 

In the first set of experiments the hydroxyl group was 
substituted by sulfonyl chlorides, as shown in Figure 5a. In 
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R— NH 2 + HO — " — R 2 
Figure 4. The amide library, 
a 



Rl - NH _ii_ R . 




o 
II 

R 2 -S-CI 

o 



L OH H< ^ 






R 3 -Cl 



Fi ,? ire i 5 V 1 (a) . benzoic acid libraf y with carboxylic acids and 
sulfonyl chlorides as substituents. (b) The benzoic acid library with 
carboxyhc acids and chlorides as substituents. 

Table 5. Test of the Diversity Hypothesis Using a 
3-Amino-5-hydroxybenzoic Acid Library 



D(c) 



D(L*) 



D(L*) 



1600 
900 
400 
100 

1600 
900 
400 
100 



0.408 
0.410 
0.413 
0.421 

0.435 
0.437 
0.440 
0.448 



(a; Use of Sulfonyl Chlorides 



0.388 
0.387 
0.384 
0.373 



0.358 (0.002) 
0.359 (0.002) 
0.360 (0.004) 
0.365 (0.020) 

(b) Use of Chlorides 

0.411 0.390(0.001) 
0.410 0.391 (0.001) 
0.409 0.391 (0.003) 
0.404 0.396 (0.009) 



0.130 
0.117 
0.105 
0.093 

0.167 
0.143 
0.114 
0.086 



the second set of experiments, the hydroxyl group was 
substituted by cWoride-containing substituents (alkyl halides 
in the original paper but modifid here to be any compound 
containing a single chlorine atom), as shown in Figure 5b. 
Reactant pools were created by selecting 400 molecules at 
random from the SPRESI database. 19 One pool consisted 
of molecules containing a single carboxyhc acid group; 
another consisted of molecules containing a sulfonyl chloride • 
group; and a third pool consisted of molecules containing a 
single chlorine atom. Libraries containing 160 000 molecules 
were enumerated from the reactant pools, and 1600-member 
subsets c, L d *, L*, and L s * were then generated as described 
previously. 

The D(L d *), D(c\ D(L*) } and D(L S *) values for these 
datasets are listed in Table 5 (parts a and b), where the same 
behavior is observed as was the case with the amide libraries; 
that is, reactant-based selection lies between random selection 
and product-based selection. Note that all of the diversity 
values here are lower than for the amides, presumably due 
to the common core substructure that is present in all these 
products, and similar comments apply to the Kemp's library 
discussed below. 

Kemp's Acid Library. Kocis et ai 20 have demonstrated 
the use of Kemp's acid as a useful building block for the 
preparation of synthetic receptors. The core molecule used 
is shown in Figure 6. Two pools of reactants were used at 
substitution positions R, and R 2) with each of these reactant 
pools containing 400 carboxyhc acids chosen at random from 
carboxyhc acids extracted from the SPRESI database The 
libraries C, c, L d \ L r \ and L s * were generated as described 
previously, with the results shown in Table 6 being compa- 
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Figure 6. The Kemp's acid library. 

Table 6. Test of the Diversity Hypothesis Using a Library Derived 
from Kemp's Acid 



m 




D{c) 


D(L*) 




1600 
900 
400 
100 


0.435 
0.437 
0.440 
0.448 


0.402 
0.401 
0.403 
0.396 


0.380 (0.001J 
0.380 (0.001) 
0.381 (0.003) 
0.386 (0.012) 


0.169 
0.145 
0.116 
0.106 



rable to those obtained with the amide and 3-amino-5- 
hydroxybenzoic acid libraries. 

ESTIMATING THE UPPERBOUND OF DIVERSITY 

The results obtained above suggest that selection at the 
reactant level is less effective than selection at the product 
level, this implying that the diversity hypothesis is not 
correct. However, it is difficult to quantify the differences 
m diversities without knowing the bounds on diversity for 
subsets selected from a library. This section describes 
experiments that attempted to determine Z)(max) in order to 
evaluate the effectiveness of reactant-based selection com- 
pared to product-based selection. 

Three different methods were used in order to estimate 
how near-optimal the subset selection method is, i.e., how 
close D(L d *) is to Dfmax). These were as follows: measur- 
ing the variation in the diversities of subsets of a given size 
chosen at random; comparing the DBCS result with a genetic 
algorithm that was designed to maximize the diversity of 
subsets; and using extreme value methods to estimate the 
end point of the distribution of diversities. The experiments 
were applied to the problem of selecting 40 diverse molecules 
from a set of 400 carboxylic acids extracted at random from 
the WDI database. The diversities that result in each 
experiment were compared with the diversity produced by 
the DBCS method, which gave a value of 0.698, Le D(Lj*) 
= 0.698. V ; 

Standard Deviation. The performance of the DBCS 
algorithm was estimated by generating many subsets at 
random in an empirical test of goodness of the heuristic. 21 
Three million subsets of size 40 were selected at random 
and their diversities measured using the Daylight fingerprint 
representation of molecular structure. The mean diversity 
was calculated as 0.618 with a standard deviation of 0.016 
The subset selected by the DBCS method has diversity of 
0.698 which is 5.0 standard deviations above the mean. If it 
is assumed that the subset diversities are normally distributed, 
then the DBCS-selected subset is superior to no less than 
99.999 999 999 9% of all possible subsets. While this result 
says nothing about the difference between Z)(max) and 
D(L d *) y it does demonstrate that the DBCS algorithm is able 
to generate subsets with very high diversities using our 
chosen diversity index. 
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Genetic Algorithm. A genetic algorithm (GA) 22 was used' 
to explore the diversity space of different subsets generated 
from the library. A GA is the computational analogue of 
Darwinian evolution. Potential solutions to a problem are 
encoded in a population of chromosomes which are linear 
representations of the problem that is to be solved and which 
are scored using a fitness function. New populations are 
developed by performing genetic-like operations of mutation 
and crossover on some members of the existing population. 
Higher scoring individuals have a higher probability of 
passing their genes into the new populations. The new 
chromosomes are scored and the GA iterates, usually until 
it has converged on a solution. GAs have previously been 
applied to many problems in computational chemistry, 23 
including the design of combinatorial libraries targeted 'at 
one particular biological assay. 5 Here, the GA was devel- 
oped to generate diverse subsets of molecules. 

Each chromosome in the GA represented a particular 40- 
member subset with each element of a chromosome repre- 
senting a molecule in the library. Thus, a chromosome 
contained 40 integers in the range 1 -400 corresponding to 
- the 400 carboxylic acids that are available for selection. The 
GA was initialized with a population of 50 chromosomes 
that represented 50 different randomly chosen subsets. The 
genetic operators crossover and mutation were applied to 
evolve new subsets, ensuring that 40 unique molecules were 
present rn each of the subsets. Mutation involved changing 
an element to a new randomly chosen element that repre- 
sented a molecule that was not already contained in the 
subset. One-point crossover was modified so that an element 
was only exchanged if the molecule represented by the new 
element did not already exist in the chromosome. The 
chromosomes were scored by calculating the diversity of the 
subsets of molecules they represented, so that the fittest 
chromosomes were those that represented the most diverse 
subsets. The fitness function therefore attempts to maximize 
the diversity of the subsets represented by the chromosomes. 

Very many experiments were carried out to identify the 
best parameter values for the GA. A high rate of mutation 
was performed relative to crossover (3 : 1) in order to prevent 
the GA from converging on a local, rather than a global 
maximum. The GA iterated until it had converged on a 
solution, i.e., until it could not find a better subset. Once 
this condition had been reached, the GA was rerun with a 
new starting population consisting of the best 10% of the 
chromosomes from the previous run, with the remaining 
chromosomes initialized to random subsets. After 1 00 runs 
of the GA, the best result obtained was a subset with diversity 
exactly equal to that found by the DBCS method (0.698). 
Thus, the GA was unable find a more diverse subset than is 
found using the DBCS subset-selection algorithm, despite 
searching through the diversity space of a large number of 
subsets selected from the library. This suggests that the 
greedy algorithm approach embodied in DBCS provides an 
extremely effective heuristic for selecting diverse subsets. 

Extreme Value Methods. Extreme value methods have 
been used as ways of estimating the extreme behavior of a 
process on the basis of an observed, independent distribu- 
tion. They have been applied in engineering situations 
where structures such as oil-rigs typically fail, e.g., overturn 
because of the occurrence of extreme values of a single 
environmental process or a critical extreme combination of 
constituent variables, such as sea surface waves and winds 
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The essential problem is one of extrapolation: to estimate 
some unknown distribution function beyond the end of the 
known observations or, in other words, to estimate the end 
point of a distribution in a finite population. Extreme value 
methods can be applied to the problem of estimating the 
maximum diversity of a subset of a given size extracted from 
a combinatorial library. A finite distribution of diversities 
exists for all possible subsets of molecules of a given size 
selected from a library. The known observations that have 
already been measured for such a distribution include D(L d *) 
Z)(Zr*), and D(L*) 3 where L d * is the subset with the largest 
calculated diversity. It is possible to generate any number 
of observations, for example, by selecting subsets at random 
and measuring their diversities as done previously and then 
using the observed distribution to predict the end point of 
the distribution of all possible subsets. 

The extreme value method applied here was the Genera- 
lised Pareto Distribution (GPD) 25 which has been widely used 
to model the upper tails of distributions. The GPD method 
was applied to subset selection in order to estimate the 
diversity of the maximally-diverse subset. In general, 
observations with the largest values are expected to provide 
the most valuable information on the end point in the GPD 
method, and, hence, a number of independent observations 
that are close to the best observation, D(L d *) } are required. 
Data points close to were generated by seeding a 

subset with five randomly chosen structures. The remaining 
35 structures in the 40-member subset were chosen by 
applying the DBCS algorithm to the full set of 400 carboxylic 
acids. This process was repeated 100 times to give 100 data 
points. None of these subsets was more diverse than the 
DBCS selected subset. The DBCS observation was also 
included in the sample. The GPD method involves sampling 
values in the distribution above a threshold. The best 
estimate for the maximum diversity using this method was 
0.699 with a standard deviation of 0.0002, which is almost 
identical to the best result found using the GA. Even 
allowing for five standard deviations above the mean, this 
still gives a value for Z)(max) of only 0.700, thus again 
suggesting the D(L d *) is a fair approximation to D(max). 

SELECTING COMBINATORIAL LIBRARIES FROM 
PRODUCTS 

The experiments in the previous section suggest that the 
DBCS algorithm is very effective at finding a subset that is 
very close in diversity to the maximally diverse subset, i.e., 
D(L d *) as Z)(max). For all of the libraries tested above, the 
mean diversity, D(L r % and standard deviation for 1000 
randomly chosen subsets of C of size 1600 represent an 
approximate lowerbound to the effectiveness of reactant- 
based selection procedures for these classes of structures It 
is clear that the D(L*) values are a large percentage of both 
the D{c) and D{L d *) values. It is certainly the case that the 
D(c) values are closer to the Z)(Z,/) values than they are to 
the D(L*) values; however, it is also clear that there is still 
considerable scope for improving the diversity of the products 
and that the correctness of the diversity hypothesis is not 
supported. This suggests that it would be beneficial to 
develop selection methods that operate in product space. 
However, although greater diversity can be achieved by 
applying DBCS at the product level, this technique is 
synthetically inefficient since the subsets of molecules do 
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Figure 7, (a) A dimer library can be represented by a 2 x 2 matrix 
where the rows of the matrix represent the reactants in one pool 
and the columns represent the reactants in the other pool The 
elements of the matrix represent dimers. The shaded elements 
represent an example of a subset library, L dy of the nine most diverse 
compounds chosen by applying DBCS to the enumerated library 
C. 1 he synthetic inefficiency of the method is highlighted by the 
number of reactants that are required to build the compounds, i.e., 
*3, xa, x 5 , xe, x ly and x% are required from pool R x and reactants y\ 
n l £ 4 ' y vl y7 > n > and y9 are re <l uir ed from pool R 2 . (b) A n x n 2 subset 
ol the library that is also a combinatorial library can be selected 
by intersecting n x rows with n 2 columns, for example, the 3 x 3 
library built from reactants x 3 , x 6 , and * 8 reacted with reactants v 2 
y 4) and y s is represented by the shaded elements of the matrix (c) 
Reordering of the rows and columns of the matrix results in the 
combinatorial library occupying the top left hand corner of the 
matrix Selecting a maximally diverse combinatorial library is then 
equivalent to reordering the rows and columns of the matrix in 
order to maximize the diversity of the molecules in the top left- 
hand corner of the matrix. 

not represent combinatorial libraries, and thus it is of limited 
use in practical combinatorial chemistry. 

rJ™ syn ^ hetic ine fficiency resulting from perfonriing 
DBCS at the product level is illustrated in Figure 7a. A 
fully enumerated combinatorial library, C, built from two 
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reactant pools can be visualized as a two-dimensional malrix 
The rows of the matrix represent the JV, reactants available 
in pool R u and the columns of the malrix represent the N 2 
reactants in pool R 2 . The elements of the matrix then 
represent the full combinatorial library (C), of size N t N 2 
that would result from reacting all the reactants in if, with 
all the reactants in R 2 . In Figure 7a, pool R t contains the 
nine reactants labeled xj..* and pool R 2 contains the nine 
reactants .y, ...y 9 . Assume that we wish to select the nine most 
diverse compounds from C. The DBCS algorithm can select 
compounds from anywhere within the malrix, for example, 
Ihe library, L d *, selected using DBCS might correspond to 
the shaded elements, as shown The nine compounds 
illustrated require six reactants from pool ft and seven 
reactants from pool R 2 , rather than three from each pool as 
would be required to build a nine-member subset that is a 
combinatorial library. 

In the experiments performed using the amide library the 
1600 molecules selected from the full amide library' are 
constructed from 137 amines and 146 carboxylic acids The 
systematic joining of all these amines to all of (he carboxylic 
acids as performed in practical combinatorial synthesis would 
result in 19 992 molecules, of which the 1600 most diverse 
molecules are a subset. The synthetic inefficiency of 
performing selection at the product level has also been noted 
by Cnbbs et a!.; 2 * in their work, nearly all of the reactants 
were required in order to build the molecules selected In 
this section we investigate whether it is possible to generate 
a combinatorial library from the products that is more diverse 
than the library generated by selecting at the reactant level 
A nine-member subset of the dimer library, C, that 
represents a combinatorial library can be selected by 
intersecting three rows of the matrix with three columns 
For example, a 3 x 3 library built from reactants x 3 , x 6 , and 
x s reacted with reactants y 2 , y 4 , and y s is shown by the shaded 
elements of the matrix in Figure 7b. For ease of visualiza- 
tion, assume that the rows and columns of the matrix are 
reordered so that the shaded elements occupy the top left- 
hand comer of the matrix, Figure 7c, and that the diversity 
of this subhbrary is then measured. It is possible to reorder 
the rows and columns so that all possible combinatorial 
sublibranes can be positioned in the top left-hand corner 
Thus, selecting a combinatorial library from product space 
can be visualized as the reordering of an n-dimensional 
matrix, where there are n reactant pools involved in the 
reaction. Finding a maximally diverse combinatorial library 
is then equivalent to reordering the rows and columns of 
the matrix and measuring the diversity, of all possible 
sublibranes that occupy the ^-member, top left-hand 
corner of the matrix. Exploring all permutations of rows 
and columns represents an enormous search space even for 
libraries of moderate size, and hence, in practice the 
manipulation of the matrix is achieved using a genetic 
algorithm. Simulated annealing has also been applied to the 
problem of row-column matrix manipulation in a different 
context. 27 

The GA is similar to that described in the previous section 
that was designed to select a maximally diverse subset of 
reactants, although in this case each chromosome of the GA 
represents one combinatorial library. For a dimer library of 
size «,n 2 a chromosome consists of two parts; the first part 
represents the «, reactants selected from pool R, (or the rows 
of the matnx) and the second part consists of the n 2 reactants 
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selected from pool R 2 (or the columns of the matrix) The 
fitness function of the GA is applied to each chromosome 
and involves constructing the «,b 2 combinatorial library, C* 
represented by the chromosome and measuring its diversity' 
D{C*). Diversity is measured by summing the pairwise 
dissimilarities as described previously and the GA attempts 
to maximize the diversity, D(C*). The GA uses a population 
of 1 00 chromosomes. One of the chromosomes in the initial 
population is initialized to the reactant subsets found by 
performing DBCS on the reactant pools themselves and thus 
represents a solution with diversity D(c). The remaining 
chromosomes are initialized to random subsets. 

The genetic operators of one-point crossover and mutation 
are implemented. In each case, duplicate entries in either 
naif of the chromosome are forbidden. Mutation involves 
altering some elements of the chromosome to new elements. 
The mutation operation is equivalent to exchanging a reactant 
m one selected subset with a different one from the relevant 
pool. This is equivalent to exchanging a row or a column 
from the top left-hand comer of the matrix with a row or a 
column that appears lower in the matrix. Crossover creates 
two new child chromosomes and is applied to one part of 
the parent chromosome only, i.e., to one of the reactant pools 
A crossover point is chosen at random, and the reactants in 
that subset in each parent are exchanged after that crossover 
point, provided that they do not result in duplicate entries in 
either half of the chromosome. This is equivalent to having 
one of the subsets remain unchanged in each parent and 
mixing the reactants between the parents in the other subsets 
As in the previous GA, the mutation operator is applied at 
a higher rate than crossover (3:1) in order to reduce Hie 
chances of the GA finding a local rather than a global 
maximum. The GA is a steady-state with no-duplicates 
algorithm. 

The GA was run on the libraries described previously The 
best results were obtained by repeatedly rerunning the GA 
with a new starting population, consisting of the best 10% 
of the chromosomes from the previous run, with the 
remaining chromosomes initialized to random subsets The 
reactant pools each consisted of 400 reactants which when 
reacted together generated libraries (Q of size 160 000 
DBCS was performed at the product level to select libraries 
U* of size 1600. DBCS was also performed at the reactant 
level to generate libraries, c, also containing 1 600 molecules 
Finally, combinatorial libraries, C*. were selected from the 
products and their diversities measured and compared with 
the Z)(c) and D(L d *). 

The results obtained are shown in Table 7. In all cases 
combinatorial libraries are selected that are significantly more 
diverse than if compound selection is performed at the 
reactant level. In fact, the combinatorial libraries are 
intermediate in diversity between performing DBCS on the 
reactants and performing DBCS on the products. Although 
tiie combinatorial libraries selected using the GA are not as 
diverse as the DBCS libraries that are selected from product 
space, they are synthetically efficient, and ttie reagents 
represented by them can be fed directly into combinatorial 
synthesis experiments. This algorithm is therefore a sig- 
nificant improvement over performing DBCS on the reactant 
pools. Not only is it highly effective in operation but it is 
also very efficient, with the selection of 40 x 40 reactant 
pools from a 400 x 400 virtual library requiring ap- 
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Table 7. Comparison of Results from Applying DBCS to the 
Reactants and Enumerating Libraries D(c), Applying DBCS to the 
Fully Enumerated Product Library D(L d ) and Using the GA To 
Select Diverse Combinatorial Libraries from the P roducts, D(C*) 
library 
amides 



benzoic acid 



Kemp's acid 



#(£■) 


^Xf^a ) 


u v^ ) 


D{c) 


1600 


0.652 


0.623 


0.596 


900 


0.658 


0.628 


0.594 


400 


0.663 


"0.632 


0.596 


100 


0.674 


0.637 


0.582 


1600 


0.408 


0.396 


0.388 


900 


0.410 


0.396 


0.387 


400 


0.413 


0.398 


0.384 


100 


0.421 


0.399 


0.373 


1600 


0.435 


0.419 


0.401 


900 


0.436 


0.422 


0.401 


400 


0.440 


0.422 


0.403 


100 


0.448 


0.419 


0.396 



proximately 20 min using a Silicon Graphics R10000 
processor. 

DISCUSSION 

The diversity hypothesis has been tested for three different 
libraries and using two different structural representations 
In all cases it is seen that more diverse libraries result from 
applying DBCS at the product level rather than at Ihe reactant 
level. Also the experiments designed to find the maximally 
diverse subset suggest that DBCS is near-optimal. It is 
concluded that the diversity hypothesis is not supported and 
that there is still considerable scope for improving the 
diversity of libraries resulting from applying compound 
selection to reactant pools. 

A significant limitation of performing DBCS at the product 
library level is that the resulting libraries do not represent 
"true" combinatorial libraries, that is, they require larger 
pools of reactants than if selection is performed at the reactant 
level, and the reactants are enumerated into products in all 
possible ways. We have hence developed an algorithm for 
selecting combmatorial libraries from product libraries that 
are significantly more diverse than if selection is performed 
by analyzing reactant space. This is a practical solution that 
allows reactants to be selected by analyzing product space. 
The GA has been applied to optimize subset diversity 
according to one measure, that is, Hie sum of pairwise 
dissimilarities of the selected molecules. However, it could 
also be applied to other diversity measures, for example 
partition-based measures that look to maximize the coverage 
of partitions by the selected molecules, if these could be 
performed sufficiently rapidly to form the fitness function 
for the GA. 

The limitations of the experiments must be emphasized. 
Firstly, the libraries considered contain just 160 000 mol- 
ecules, whereas much larger virtual libraries are easily 
conceived. Although the libraries studied represent different 
types of "reactions," ranging from a library where the 
structures do not have a common core to a library where all 
of the products share a large common core substructure they 
were all based (in whole or in part) on carboxylic' acid 
reactants, and it may be that different types of reactants lead 
to different results. 

Secondly, only one kind of subset-selection method was 
considered, lhat of dissimilarity-based compound selection 
Other methods of selecting subsets include clustering 
partition-based selection, and D-optimal design. However' ' 
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these methods are much more computationally intensive and 
it is not practical to apply them to large collection^ of 
compounds. For example, Cribbs et d* compared selection 
at the virtual library level with reactant-based selection using 
D-optimal design, clustering, and a uniform shell approach 
They concluded that the computational costs of selecting 
even 900 compounds from 14 000 using these methods were 
too high. 

Another limitation of these experiments is that only one 
type of diversity measure has been considered; several other 
such measures have been reported, 6 -'- 16 . 28 and some of these 
alternatives might be applied to the testing of the diversity 
hypothesis or might be used as the fitness function of the 
GA. 

In conclusion, this paper has discussed the use of dis- 
sirmlanty-based compound selection (DBCS) to provide a 
quantitative test of the diversity hypothesis. Experiments 
with several different combinatorial libraries show that 
reactant-based selection procedures result in sets of products 
that are intermediate in diversity between comparably-sized 
sets selected at random from a fully enumerated product 
library and selected from that library to maximize diversity. 
Thus, while reactant-based selection is an extremely efficient 
way of generating combinatorial libraries, it is less effective 
than, ideally, one might wish. Our results also suggest that 
existing algorithms for DBCS identify subsets that are very 
close in diversity to optimally dissimilar subsets. Finally 
we have described an algorithm for selecting combinatorial 
libraries from enumerated product libraries that are signifi- 
cantly more diverse than those generated using reactant-based 
selection. 
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